Module 02 - Python for LLM Engineering

Reading time: ~20 minutes | Level: Advanced

The Gap Between a Demo and a Product

You have called openai.chat.completions.create() before. You pasted in a prompt. The model replied. It worked.

Now imagine you are three months into building a product on top of that API call. Here is what your codebase looks like now:

# Version 1: The demo (day 1)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content

# Version 2: The production system (month 3)
async def generate(
    prompt: str,
    *,
    system: str | None = None,
    max_tokens: int = 2048,
    temperature: float = 0.7,
    model: str = "gpt-4o",
    user_id: str | None = None,
) -> AsyncIterator[str]:
    messages = _build_messages(system, prompt, history=await _load_history(user_id))
    _check_token_budget(messages, max_tokens, model)

    async with _rate_limiter:
        async for chunk in _stream_with_retry(
            model=model,
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature,
        ):
            _log_chunk(chunk, user_id)
            yield chunk.delta.content or ""

    await _persist_turn(user_id, prompt, accumulated_response)
    await _update_cost_ledger(user_id, _count_tokens(messages, model))

That gap is what this module teaches. Not LLM theory. Not prompt writing tips. The engineering required to put an LLM in production and keep it there.

What Makes LLM Engineering Hard

Calling a REST API sounds trivial. But LLM APIs violate almost every assumption you have about how APIs behave.

They are stateless but your users expect state. The model has no memory. Every request must carry the full conversation history. You manage that history, decide what to include, and handle context windows that overflow.

They are non-deterministic. The same input produces different outputs on every call. You cannot cache responses the way you cache database queries. Testing requires probabilistic approaches.

They are slow. A GPT-4o call takes 2-30 seconds depending on output length. That latency is unacceptable in a synchronous web handler. You need streaming, async, and careful UX design.

They fail in unexpected ways. Rate limits hit mid-conversation. The API returns a partial JSON object. A tool call arrives split across two stream chunks. Connection drops after 40 tokens. Every failure mode is different.

They are expensive and hard to meter. Cost is per token. A single runaway prompt loop can burn hundreds of dollars in minutes. You need token budgets, cost tracking, and circuit breakers.

The output is unstructured. Even with JSON mode, models occasionally produce invalid JSON, truncated arrays, or hallucinated field names. Parsing LLM output requires defensive code.

This module addresses all of these problems systematically.

What You Will Build

Across six lessons, you will construct the Python infrastructure that underlies every serious LLM application:

Lesson	What You Build
01 -- Calling LLM APIs	Production API client with retry, rate limiting, cost tracking, and structured output
02 -- Streaming	Async streaming pipeline from API to FastAPI endpoint to browser
03 -- Prompt Templates	Versioned prompt system with Jinja2, validation, and injection defense
04 -- Token Counting	tiktoken-based budget manager, context truncation strategies, sliding window
05 -- Tool Use	Python function dispatcher, schema generation, multi-step agent loop
06 -- Vector Search	Embedding pipeline, FAISS index, retrieval-augmented generation (RAG)

By the end, you will have a working skeleton of a production RAG chatbot with tool use, streaming, and cost tracking.

A Map of a Production LLM System

Before diving into individual lessons, step back and see the whole picture. Here is the Python layer of a production LLM system:

Each box in this diagram corresponds to code you will write in this module. The lessons are ordered so that each one builds on the last.

Mental Models You Need

Tokens Are Not Characters

This is the most common mistake beginners make. Models do not see text. They see tokens -- integer IDs representing subword units. The word "python" is one token. The word "unbelievable" might be two or three tokens depending on the tokenizer. A space before a word is often a different token than the same word without a space.

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

text = "Python is unbelievable."
tokens = enc.encode(text)
print(tokens)         # [31380, 374, 46455, 13]
print(len(tokens))    # 4 tokens for 23 characters

Why it matters:

API costs are per token, not per character
Context window limits are in tokens (128K tokens is not 128K characters)
Token counting must happen before every API call to stay within budget
Truncation must happen at token boundaries, not character boundaries

Context Windows Are Finite Queues

Every model has a maximum context window: the total number of tokens it can process in a single call (input + output combined). As of 2025, frontier models have 128K-200K token windows. That sounds large until you are building a multi-document RAG system.

The mental model: the context window is a sliding window over an infinite conversation. When it fills up, you must decide what to drop. This is not automatic. You must implement the eviction policy.

The orange "Retrieved Docs" region is the one you control most directly. The retrieval strategy determines what goes here and in what priority order.

Embeddings Are Vectors in Semantic Space

An embedding model converts text into a dense vector of floating-point numbers (typically 768-3072 dimensions). Texts with similar meaning have vectors that are close together in this high-dimensional space. "The cat sat on the mat" and "A feline rested on a rug" are semantically similar -- their embeddings will have high cosine similarity.

# Conceptually (not actual embedding dimensions shown)
embedding("The cat sat on the mat")   == [0.12, -0.34, 0.89, ...]
embedding("A feline rested on a rug") == [0.11, -0.31, 0.91, ...]  # close
embedding("Stock prices fell 3%")     == [-0.67, 0.22, -0.44, ...] # far

This is the foundation of RAG: embed your documents, store the vectors in a vector database, embed the user's query, find the nearest document vectors, and inject those documents into the LLM's context. The LLM then answers using retrieved facts rather than hallucinating from training data.

Streaming Is a Protocol, Not a Feature

When you enable streaming, the API does not wait for the full response before sending anything. It sends tokens as they are generated, using HTTP chunked transfer encoding or Server-Sent Events. Your Python code receives a stream of deltas -- tiny JSON objects each containing a fragment of the response.

# Non-streaming: one response object, arrives after full generation
response = client.chat.completions.create(model="gpt-4o", messages=[...])
print(response.choices[0].message.content)  # The whole thing at once

# Streaming: many delta objects, arrive as they are generated
stream = client.chat.completions.create(model="gpt-4o", messages=[...], stream=True)
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)  # Print each fragment immediately

The moment you add tool use to your streaming pipeline, things get more complex: tool call arguments arrive in fragments too, and you cannot dispatch the tool until you have reassembled the complete JSON arguments.

How This Module Fits the Learning Path

This module assumes you have completed:

Python Foundation: functions, classes, exceptions, file I/O
Python Intermediate: async/await, asyncio, generators, context managers, type hints
Python Advanced Module 1: scientific Python stack (NumPy, Pandas, PyTorch basics)

It feeds into:

Module 3 -- ML Engineering: building training pipelines with LLM-generated data, evaluation harnesses
Module 4 -- MLOps: deploying LLM applications, A/B testing prompts in production, model versioning

If you have completed the Python Advanced track (metaprogramming, async, performance), many patterns in this module will feel familiar. You will recognize async generators in the streaming lesson, context managers in the API client patterns, and dataclasses in the prompt template system.

Prerequisites Checklist

Before starting Lesson 01, verify you can answer these questions without looking them up:

What is async def and when do you need it instead of def?
What does await do, and what can you await?
What is AsyncIterator[str] and how do you consume one with async for?
What is a context manager (with statement) and how does __enter__/__exit__ work?
What is a dataclass and how does it differ from a plain class?
What does @retry from the tenacity library do?

If any of these are unclear, review the relevant Python Intermediate or Advanced lessons before continuing.

Environment Setup

All lessons in this module use these packages. Install them now:

pip install anthropic openai tiktoken tenacity httpx fastapi uvicorn \
            jinja2 pydantic faiss-cpu numpy sentence-transformers

You will need API keys from Anthropic and/or OpenAI. Store them as environment variables:

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."

Or use a .env file with python-dotenv:

from dotenv import load_dotenv
load_dotenv()  # reads .env into os.environ

tip

Never hardcode API keys in source code. Never commit them to version control. Use .env files locally and environment variables in production. One leaked key in a public GitHub repository will be found by automated scanners within minutes.

Key Takeaways

The distance between an LLM demo and a production LLM system is measured in retry logic, token budgets, streaming pipelines, and cost tracking.
Tokens are not characters. Always count tokens before calling the API, not after.
The context window is a finite resource you manage. Eviction policy is your responsibility.
Embeddings map text to vectors in semantic space. Closeness in vector space means semantic similarity.
Streaming is a chunked protocol. Tool calls in streams must be reassembled before dispatch.
This module builds the Python infrastructure for a production LLM system, lesson by lesson.

Start with Lesson 01 -- Calling LLM APIs.

The Gap Between a Demo and a Product​

What Makes LLM Engineering Hard​

What You Will Build​

A Map of a Production LLM System​

Mental Models You Need​

Tokens Are Not Characters​

Context Windows Are Finite Queues​

Embeddings Are Vectors in Semantic Space​

Streaming Is a Protocol, Not a Feature​

How This Module Fits the Learning Path​

Prerequisites Checklist​

Environment Setup​

Key Takeaways​